Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark on k8s : changes for using namspaces with Kubernetes client APIs. #462

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ashokkumarrathore
Copy link

@ashokkumarrathore ashokkumarrathore commented Jan 1, 2025

What changes were proposed in this pull request?

Related issue : #461

How was this patch tested?

Tested by deploying updated server in a k8s cluster with namespaces and running a job.

@ashokkumarrathore ashokkumarrathore changed the title Livy nmsp fix Spark on k8s : changes for using namspaces with Kubernetes client APIs. Jan 1, 2025
@@ -138,6 +138,7 @@ object SparkKubernetesApp extends Logging {
private var sessionLeakageCheckInterval: Long = _

var kubernetesClient: DefaultKubernetesClient = _
var appNamespaces: mutable.Set[String] = mutable.Set("default")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to have a default namespaces at all? Can it be empty initially? Also is it thread safe (or can we make it so)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider making configurable allowed namespaces list? We anyway know it upfront when configuring permissions for Live service account. It worths checking how the recovery works, I haven't looked at it for a while, but having Livy monitoring all the namespaces from the get go may save us from leaked/unrecovered apps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we also do not need to propagate app namespace everywhere.
WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for inputs @jahstreet .

  1. If user does not specify a namespace, the job goes to default namespace. That's why we start with that. But we can remove it too. As it will anyway get populated after first job submission. For thread safety, i will switch it to java.util.concurrent.ConcurrentHashMap.
  2. The configurable namespace allowlist would work but that would need deployment everytime a new namespace is allowed. The rolebindings can be created in new namespaces(to allow job submission from livy service account) independently.
  3. The reason we are maintaining this list of namespaces is for leaked apps. The idea is, if we go through all namespaces where jobs were submitted, that should be enough.
  4. Livy monitoring all namespaces is not feasible due to security. In a multi tenant cluster, due to security, livy service account might not have any permissions on some namespaces.
  5. For recovery, we are also saving the namespace now in [server/src/main/scala/org/apache/livy/server/batch/BatchSession.scala] which will be used to recover.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it covers most of my thoughts 👍 .

In case namespaces are dynamic, should we also truncate the set of watched namespaces to avoid revoked access rights to the namespaces and namespace deletion edge cases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. The challenge would be to when to truncate. One condition i can think of is, just after deleting all leaked apps. In a rare case this might still leave some leaked apps(apps which were just added to leaked list).I will see what other options we have.

I still need to write some unit tests. But before that i want to make sure existing tests pass. How do i get this workflow approved from maintainer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyogal can u help here please?

Copy link
Contributor

@gyogal gyogal Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ashokkumarrathore for the PR and many thanks also to @askhatri, @jahstreet for the feedback! The unit test executor is currently broken and I am in the process of fixing it. As soon as it is working again, I will start the check jobs on this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants